Introduction

Motivation

Subway is an emerging and growing means of transportation that can transport passengers safely and effectively. However, due to the increase of metro users, subway congestion has become one of the main factors that lower the efficiency of subway operations.

I personally use subway system a lot, both in China and in the U.S. Every time I was stuck in crowds moving slowly towards the entry of subway gates, and squeeze in a completely packed subway car, I was thinking that “I would continue shopping and wait till the subway station gets less congested, if I had known how packed the station is”.

Congestion prediction, therefore, would be a great way to maximize convenience and comfort for metro users. Since NYC’s MTA subway ranks top among the world’s subway systems in terms of annual ridership, and MTA has the most available subway turnstile data, I chose to conduct my analysis based on MTA subway in New York.

Objective

This project includes three main parts:

  1. Prediction Model

         - Build an OLS regression model for forecasting the hourly net entries of metro users at each MTA subway station in NYC.

  1. Congestion Score Computation

         - Define the congestion level of each station based on prediction results. Other factors to consider include number of turnstiles, number of subway lines running at the station, number of entrances, etc.

  1. Shiny Web Application

         - Develop a web-based user interface to inform metro users about the predicted future congestion level at each MTA station at the time they specify.

Methodology

In this section, I will explain the methods I use to build the regression model, compute the congestion score and create the web application.

Regression Model

In the regression analysis, I would like to prepare a dataset containing the hourly net entries of metro users per station in July and August 2017. The net entries is the dependent variable in the OLS regression model. The raw turnstile usage dataset I use to prepare the dependent variable is provided by New York State open data platform, and one big limitation of the dataset is that, the entries and exits data are not collected on an hourly basis but are collected every 4 hours.

To predict the dependent variable - net entries, I consider five categories of predictors, including time variables, spatial variables, census variables, weather variables and trips of other transportation methods.

For time variables, I consider factors like day of the week, day of the month, rush hour and holiday. Furthermore, I speculate that the net entries are time auto-correlated. That is, the net entries at a station in hour n may influence the net entries at that station in hour n-1. In other words, the net entries at a subway station in a certain hour may be associated with the net entries at the station in previous hour, so I also include lagged net entries as a time variable. Another predictor I include in the model is count of public events. Public events could drive people to go out and subway would be one of the major transportation choices for them.

For spatial variables, I calculate the distance between each subway station and its nearest public facilities such as bus stops, schools, plaza and malls, offices, hospitals, parking lots, as well as other subway stations.

For census variables, I choose census block group demographics from U.S. Census Bureau like population density, total housing units and race. I also include ESRI demographic metrics like businesses and percentage of development.

Moreover, people tend to use subway more in a sunny day than in a rainy day, and a warm day than a chilly day. Hence, I include weather data as predictors in the regression model.

Lastly, other transportation modes can provide people with alternatives for transit and may affect net entries. Thus, the model also includes hourly taxi trip counts and bike trip counts as independent variables for predicting net entries every 4 hours at each station. Due to limitation of data available, I do not include hourly bus trip counts.

Here is the outline of the OLS regression model:

Dependent variable:

Net Entries of Metro Users Every 4 Hours at Each Station in NYC (NetEntries)

Predictors:

  1. Time Variables

         - Time lag (lag_NEntries)

         - Day of the week (Weekday)

         - Day of the month (Day)

         - Rush hour or not (Rush_Hour)

         - Holiday or not (Holiday)

         - Event count (Event_Count & Event_Count2)

  1. Spatial Variables

         - Distance to hospital (d_hospital)

         - Distance to college (d_college)

         - Distance to nearest subway stations (d_subway)

         - Distance to parks (d_parks)

         - Distance to CBDs (d_cbd)

         - Distance to bus stops (d_busstop)

         - Distance to public schools (d_school)

         - Distance to plaza and malls (d_plazamalls)

         - Distance to recreation facilities (d_recreation)

         - Distance to parking lots (d_parkinglot)

         - Distance to offices (d_office)

  1. Census Variables

         - Total Housing Unit (TOTHU_CY)

         - Population Density (POPDENS_CY)

         - Median Household Income (MEDHINC_CY)

         - Businesses (S01_BUS)

         - Percentage of Development (HISPPOP_CY)

         - Race (WHITE_CY, BLACK_CY, AMERIND_CY, ASIAN_CY, PACIFIC_CY, OTHRACE_CY)

  1. Weather Variables

         - Temperature (TEMP_C)

         - Humidity (HUMIDITY)

         - Pressure (PRESSURE)

         - Wind speed (WIND_SPEED)

         - Wind direction (WIND_DIR)

         - Weather description (WEATHER)

  1. Trips Variables (Taxi_CNT)

         - Taxi trips (TaxiTrips)

         - Bike trips (BikeTrips)

Before running the OLS regression using these predictors mentioned above, I conduct some exploratory analysis to better understand the temporal and spatial pattern of subway net entries and help me decide which variables to use in the final model.

The OLS regression analysis includes three parts - in-sample regression, out-of-sample regression and cross validation. I first run the regression on the full dataset as in-sample regression. After that, I randomly select 75% of the net entries observations from the data as the training set and the other 25% net entries observations as the test set to run out-of-sample regression. Finally, I validate my regression model by conducting a 20-fold cross validation, which allows me to see how generalizable my model is.

In the next section - Installation, I will explain in details how I conduct both exploratory and regression analysis to build the final OLS model.

Congestion Score Computation

The main result of the finished OLS regression model is predicted net entries every 4 hours at each MTA subway station in NYC in July and August 2017. This prediction result is used as the main factor to compute the congestion score.

Other factors I consider to compute the congestion socre include weekday or weekend, rush hour or not, number of turnstiles, number of subway lines running at each station, and number of station entrances. I perform additional calculations to obtain the following 8 metrics for congestion score computation: Weekday vs. Weekend (s_weekday), Rush hour vs. Not rush hour (s_rushhr), Predicted net entries (s_pred), Net entries per turnstile (s_entTrnst), Net entries per entrances (s_entEnt), Net entries per subway lines (s_entLine), Turnstiles per entrance (s_trnstEnt) and Subway lines per entrance (s_lineEnt). All these factors are divided into 5 quantile breaks and assigned a score with 5 meaning most likely to be congested and 1 meaning least likely to be congested. I then rank these 7 factors from most important to least important, and give them a weight from 5 to 1, with 5 meaning most important and 1 meaning least important. Finally, I compute the congestion score using the following equation:

                                                                                    Congestion Score =

        5 X s_entLine + 4 X s_weekday + 3 X (s_entTrnst+s_entEnt) + 2 X (s_trnstEnt+s_lineEnt+s_rushhr) + s_pred

Finally, I divide the computed congestion scores into 5 categories: first 10% has a congestion level of 1, 10% - 40% has a congestion level of 2, 40% - 60% has a congestion level of 3, 60% - 90% has a congestion level of 4 and 90% - 100% has a congestion level of 5. When the congestion level is identified to be 5, the metro users can safely consider the station to be congested at the time they specify and try to find an alternative way of transit or change a different time to take the subway.

Shiny Web Application

The interactive web application is created using Shiny package in R. The application allows metro users to check how congested an MTA subway station will be at a specified time.

The dataset I use to build the web app contains predicted net entries and congestion score for all MTA stations in July and August 2017. The web app serves as a prototype for building more practical congestion forecast web app in the future.

Please click here to access my published MTA subway congestion forecast web app. It may take several seconds to load since the dataset has over 100,000 rows of data. More detailed explaination of how users can use the web app will be provided in later section.

Installation

In this section, a detailed step-by-step installation procedure is presented to show how I conduct my analysis.

Step 1: Data Collection

The first step of building this OLS regression model is data gathering. Below is a list of the raw data I collected, and a description of where I collect the data and what information it tells.

1. Turnstile Usage Data 2017

This CSV dataset is downloaded from New York State open data platform. The dataset contains cumulative entry and exits data for each turnstile unit at each MTA station. The image below shows how the raw dataset looks like.

2. MTA Subway Entrance and Exit Data

This CSV dataset is also downloaded from New York State open data platform. The data contains all entrances and exits information for MTA stations, as well as which subway lines are running at each station.

3. Weather Data

The weather data is obtained from Kaggle. The data is posted by user Selfish Gene. The data contains 5 years of hourly data of various weather attributes, such as temperature, humidity, air pressure, etc, and is available for 30 US and Canadian Cities (including New York), as well as 6 Israeli cities. The data is acquired using Weather API on the OpenWeatherMap website, and is available under the ODbL License.

4. School Locations

The shapefile is obtained from NYC Open Data. The data contains school point locations based on the official address.

5. Bus Stop Locations

The shapefile is obtained from The William and Anita Newman Library. The data contains bus stop locations in NYC.

6. Park Locations

The shapefile is downloaded from NYC Open Data. The data contains a polygon layer displaying all parks in NYC.

7. Facilities Database

The shapefile is also downloaded from NYC Open Data. The data contains information about 35,000+ public and private facilities and program sites that are owned, operated, funded, licensed or certified by a City, State, or Federal agency in the City of New York.

8. NYC Census Data

I obtain census data in census block groups level from ArcGIS Online. After logging in, I locate New York, select “Analysis” and “Layer Enrichment”, and add population density, total housing units, median household income and other needed census variables in USA census block group level to the map layer. Then I export these data as a shapefile for importing into R.

9. NYC Permitted Event Information

The CSV file is obtained from NYC Open Data again. This CSV dataset lists all approved event applications that will occur within the next month since 2013.

10. Citi Bike Trips Data

The CSVs are obtained from Citi Bike System Data. The data contains information like bike trip start and end time, station, as well as trip duration. I only download bike trips data in July and August 2017.

11. Taxi Trips Data

The yellow and green taxi trips data are downloaded from NYC Open Data. I only download taxi trips data in July and August 2017. The data contains taxi trips pick up time and location.

Step 2: Library, Map Theme and DV Data

Before starting this step, I create a new file, set working directory, install and load the following packages in R.

library(corrplot)
library(caret) 
library(AppliedPredictiveModeling)
library(stargazer)
library(ggmap)
library(tidyverse)
library(sf)
library(FNN)
library(data.table)
library(car)
library(spdep)
library(lubridate)
library(rgdal)
library(sp)
library(dplyr)

Then I set map theme and plot themes for future plotting of maps and plots.

myTheme <- function() {
  theme_void() + 
    theme(
      text = element_text(size = 7),
      plot.title = element_text(size = 20, color = "#FEFFEF", hjust = 0.5, vjust = 0, face = "bold"), 
      plot.subtitle = element_text(size = 16, color = "#FEFFEF", hjust = 0.5, vjust = 0),
      axis.ticks = element_blank(),
      plot.background = element_rect(fill = "black", color="white"),
      panel.border = element_rect(colour = "grey", fill=NA, size=0),
      plot.margin = margin(0.5, 0.5, 0.5, 0.5, 'cm'),
      legend.key.height = unit(0.4, "cm"), legend.key.width = unit(0.8, "cm"),
      legend.title = element_text(size = 14, color = "#FEFFEF", hjust = 0.5, vjust = 0, face = "bold"),
      legend.text = element_text(size = 12, color = "#FEFFEF", hjust = 0.5, vjust = 0),
      legend.direction = "horizontal", 
      legend.position = "bottom"
    )
}

plotTheme <- theme(legend.position="none",
                   text = element_text(size = 12),
                   axis.text=element_text(size=12, color = "#BEBEBE", face="bold"),
                   plot.margin = unit(c(0.1, 0.1, 0.1, 0.1), "in"),
                   plot.title = element_text(colour="white", size=18, face ="bold"),
                   plot.subtitle = element_text(colour="white", size=14),
                   axis.title.x=element_text(size=12, color = "white", face="bold"), 
                   axis.title.y=element_text(size=12, color = "white", face="bold"),
                   panel.background = element_rect(fill = "black"),
                   plot.background = element_rect(fill = 'black'))

plotTheme2 <- theme(panel.grid.major = element_line(colour = "#626262"),
                    panel.grid.minor = element_line(colour = "#626262"),
                    legend.position="bottom",
                    legend.direction = "horizontal",
                    text = element_text(size = 12),
                    axis.text=element_text(size=12, color = "#BEBEBE", face="bold"),
                    plot.margin = unit(c(0.1, 0.1, 0.1, 0.1), "in"),
                    plot.title = element_text(colour="white", size=18, face ="bold",hjust = 0.5),
                    plot.subtitle = element_text(colour="white", size=14),
                    axis.title.x=element_text(size=12, color = "white", face="bold"), 
                    axis.title.y=element_text(size=12, color = "white", face="bold"),
                    panel.background = element_rect(fill = "black"),
                    plot.background = element_rect(fill = 'black'),
                    legend.background = element_rect(fill = 'black'),
                    legend.box.background = element_rect(fill = 'black'),
                    legend.text = element_text(color = "white", face="bold"))

After setting libraries and themes, the most important initial task is to deal with the dependent variable data. As mentioned above, I need to prepare a dataset containing the net entries every 4 hours at each MTA station in July and August 2017 in NYC.

To prepare the this dataset, I first load turnstile usage data in 2017 and filter it to only keep data in July and August.

# Import turnstile data - filter by July and August 2017
turnstile <- read.csv("Turnstile_Usage_Data__2017.csv")
# Extract year and month for filtering, also day and weekday
time.date <- mdy(turnstile$Date)
turnstile.modified <-
  turnstile %>%
  mutate(Year = as.factor(year(time.date)), 
         Month = as.factor(month(time.date)), 
         Day = as.factor(day(time.date)), 
         Weekday = as.factor(wday(time.date)),
         Station = as.factor(Station))
# Filter by year
turnstile2017 <- filter(turnstile.modified, Year == 2017)
# Filter by month
turnstileJulAug <- filter(turnstile2017, Month == 7 | Month == 8)
# Add hour to the data frame
time <- hms(turnstileJulAug$Time)
turnstileJulAug <- turnstileJulAug %>%
  mutate(Hour = as.factor(hour(time)))

After loading and filtering the raw turnstile usage data, I conduct data wrangling and data cleaning. First, a concatenated column named “serial” is created in order to create a unique identifier for each turnstile unit. Recall that the turnstile data records accumulate entries and exits data every 4 hours. However, different turnstiles update the entries and exits in different hours. Some update in hour 1, 5, 9, 13, 17 and 21 while others update in hour 0, 4, 8, 12, 16 and 20. I remove all data rows updating entries and exits in hours other than the 12 hours mentioned above, for the sake of the regression later. Besides, the recorded entries and exits are in fact the actual numbers on the turnstile counters and do not show the net entries and exits in each hour. Therefore, it is necessary to calculate the difference between the counter at hour h and h-1.

# DATA WRANGLING #
# A "serial number" named "serial" is created a concatenated column in order to create a unique identifier for each turnstile unit.
turnstileJulAug$serial<-paste(turnstileJulAug$C.A,turnstileJulAug$Unit,turnstileJulAug$SCP,sep='')
# Each unit is sorted by serial, date, time
turnstileJulAug<-arrange(turnstileJulAug,serial,Date,Time)
# Some record time in 1-5-9-13-17-21, some in 0-4-8-12-16-20
# Don't need other hour's data
turnstileJulAug.filter <- filter(turnstileJulAug, Hour != 2 & Hour != 3 & Hour != 6 & Hour != 7 & Hour != 10 & Hour != 11 & Hour != 14 & 
                                   Hour != 15 & Hour != 18 & Hour != 19 & Hour != 22 & Hour != 23)
# The fields ENTRIES and EXITS do not show the entries and exits.  
# These are the actual numbers on the turnstile counters.  
# To get actual entries and exits, the difference between the counter at t and t-1 must be taken.  
netcalc <- turnstileJulAug.filter %>%
  group_by(serial) %>%  
  mutate(Net_Entries= Entries - lag(Entries))
netcalc <- netcalc %>%
  group_by(serial) %>%  
  mutate(Net_Exits= Exits - lag(Exits))

As for data cleaning, I first deal with negative entries, negative exits or NA at each turnstile unit. I create a dataframe containing all problematic data and find out that quite a few turnstiles have many issues. I delete all turnstile units with error count more than 100 times.

# DATA CLEANING #
# Take a closer look at the units with negative entries, negative exits, or NA 
suspect_data <- filter(netcalc, Net_Entries<0 | Net_Exits<0 | is.na(Net_Exits) | is.na(Net_Entries))
suspect_data_summary <- suspect_data %>%
  group_by(serial) %>%
  summarise(error_count = n())

# The summary of the suspect data shows that on average, the machines having issues show up about 5 times.  
# However, the summary shows that there is one machine that has 382 error occurences.  
summary(suspect_data_summary)
# Actually, there are quite a few turnstiles that have many issues.  
bad_unit <- arrange(suspect_data_summary,desc(error_count))

# Delete all units with error count > 100 times
bad_unit <- filter(bad_unit, error_count > 100) %>%
  mutate(bad = 1) %>%
  select(-error_count)
netcalc1 <- left_join(netcalc, bad_unit)
netcalc1[is.na(netcalc1)] <- 0
filtered_unit <- filter(netcalc1, bad == 0) %>%
  select(-bad)

The next step is to join filtered and cleaned turnstile dataset with the station data containing location information of each MTA station. One big challenge I encounter is that, the station file containing location information has a different station naming system from the turnstile usage data. To overcome the challenge, I first import subway station entrance and exit data and perform necessary manipulation. For example, I change all station names to upper case because station names in turnstile usage data are in upper case. Then I modify the station names manually in Excel. Lastly, I join the turnstile usage data to the modified station data frame containing location information.

# Import subway station entrance and exit data
subwayStation <- read.csv("NYC_Transit_Subway_Entrance_And_Exit_Data.csv")
MTA <- data.frame(Station = subwayStation$Station.Name,
                  Longitude = subwayStation$Station.Longitude,
                  Latitude = subwayStation$Station.Latitude)
MTA = MTA[!duplicated(MTA$Station),]
# Change all values to upper case text for join later
MTAStation <- mutate_all(MTA, funs(toupper)) %>%
  mutate(Longitude = as.numeric(MTA$Longitude),
         Latitude = as.numeric(MTA$Latitude))
# Modify station name for join later
MTAStation.modified <- MTAStation %>% 
  mutate(Station = gsub('TH', '', Station)) %>%
  mutate(Station = gsub('RD', '', Station)) %>%
  mutate(Station = gsub('ND', '', Station))

# Match stations
# The station names are very messy - had to write to csv to clean the data
# Get list of stations in turnstile data
turnstileStation <- data.frame(Station = filtered_unit$Station,
                               Division = filtered_unit$Division)
turnstileStation = turnstileStation[!duplicated(turnstileStation$Station),]
write.csv(MTAStation.modified, file="station.csv")
write.csv(turnstileStation, file="turnstileStation.csv")

# After modification, write the station file back into R
MTAStation.modified2 <- read.csv("station.csv")
# Make sure the station has turnstile data in both July and August
MTAStation2 <- MTAStation.modified2 %>%
  filter(Station %in% turnstileStation$Station)

# Finally ready to join
turnstileJoinStation <- left_join(filtered_unit, MTAStation2)
# Some stations have no lat and lon, remove them
MyData <- na.omit(turnstileJoinStation, cols=c("Longitude", "Latitude"))

Last step is to aggregate turnstile usage data. Now I have a dataset containing net entries for each turnstile unit, but I need a dataset with net entries for each MTA station. This can be accomplished easily using group_by and summarise.

# Aggregate data
MyData_Aggregate <-
  MyData %>%
  group_by(Station, Date, Hour) %>%
  summarise(NetEntries = sum(Net_Entries),
            NetExits = sum(Net_Exits)) %>%
  as.data.frame()
# Merge data frames to get lat and lon
FinalAggregate <- left_join(MyData_Aggregate, MTAStation2)

Step 3: Variables Data Preparation

Then I move on to add variables data to the dataset.

I first add weather data as variables. I import raw weather data and select out hourly weather data in July and August in New York. The filtered weather data is joined to the aggregated net entries dataset.

### Variable 1: weather, including precipitation and temperature ###
# Import raw weather data downloaded from the site
humidity <- read.csv("weather_raw_data/humidity.csv")
pressure <- read.csv("weather_raw_data/pressure.csv")
temperature <- read.csv("weather_raw_data/temperature.csv")
weather_description <- read.csv("weather_raw_data/weather_description.csv")
wind_speed <- read.csv("weather_raw_data/wind_speed.csv")
wind_direction <- read.csv("weather_raw_data/wind_direction.csv")
# Only need New York data
humidityNY <- data.frame(TIME = humidity$datetime,
                         HUMIDITY = humidity$New.York)
pressureNY <- data.frame(TIME = pressure$datetime,
                         PRESSURE = pressure$New.York)
temperatureNY <- data.frame(TIME = temperature$datetime,
                            TEMP = temperature$New.York)
weather_descriptionNY <- data.frame(TIME = weather_description$datetime,
                                    WEATHER = weather_description$New.York)
wind_speedNY <- data.frame(TIME = wind_speed$datetime,
                           WIND_SPEED = wind_speed$New.York)
wind_directionNY <- data.frame(TIME = wind_direction$datetime,
                               WIND_DIR = wind_direction$New.York)
# Combine all weather data
weather.data <- data.frame()
weather.data <- left_join(temperatureNY, humidityNY)
weather.data <- left_join(weather.data, pressureNY)
weather.data <- left_join(weather.data, wind_speedNY)
weather.data <- left_join(weather.data, wind_directionNY)
weather.data <- left_join(weather.data, weather_descriptionNY)

# Filter weather data by year and month, recall that we only need July and August 2017's data
# First separate date and hour, and modified the date "-" to "/"
weather.data <- weather.data %>% 
  mutate(DATE_TIME = weather.data$TIME) %>%
  separate(col = TIME, into = c('DATE', 'TIME'), sep = ' ') %>%
  mutate(DATE = gsub('-', '/', DATE))
weather.data$DATE <- as.factor(weather.data$DATE)
weather.data$TIME <- as.factor(weather.data$TIME)
weather.time <- ymd_hms(weather.data$DATE_TIME)
weather.data <-
  weather.data %>%
  mutate(Year = as.factor(year(weather.time)), 
         Month = as.factor(month(weather.time)), 
         Day = as.factor(day(weather.time)),
         Hour = as.factor(hour(weather.time)))
weather.data2017 <- filter(weather.data, Year == 2017)
weather.data78 <- filter(weather.data2017, Month == 7 | Month == 8)

# Join weather data to FinalAggregate
joinweather <- left_join(FinalAggregate, weather.data78, by = c('Month'='Month', 'Day'='Day', 'Hour' = 'Hour')) %>%
  select(-DATE, -TIME, -DATE_TIME, -Year)

# Note: temperature is in Kelvin, the relationship between Kelvin and Celsius is [K] - 273.15 = [C]
# Convert temperature to celcius for easier interpretation later
joinweather$TEMP_C = joinweather$TEMP - 273.15

Note that the original teperature is in Kelvin, and I convert it to Celcius in R. Below is a exploratory analysis graph evaluating the correlation between net entries and temperature. From the plot we can see there is a general trend that as temperature goes up, the number of net entries increase.

For spatial variables, I prepare in total of 11 distance factors, including distance to bus stops, schools, CBD, parks, nearby subway stations, colleges, hospitals, offices, parking lots, recreation sites and plaza and malls from each MTA station using FNN package and get.kxxn() function. Sample code of calculating these proximity factors is attached below. The same code structure is used to calculate all 10 other distance variables. All distance variables are stored in the data frame data.

### Variable 2: proximity variables ###
MTAStation3 <- st_as_sf(MTAStation2, coords = c("Longitude", "Latitude"), crs = 4326)
data <- MTAStation2
# Import location data
school <- st_read("Public_Schools_Points_2011-2012A.shp")
busstop <- st_read("stops_mn_bus_may2016.shp")
cbd <- read.csv("cbd_centroid.shp")
parks <- read.csv("large_park.shp")
# Import facility data and filter out needed facility locations
facility <- st_read("NYCfacility.shp")
college <- filter(facility, facsubgrp == "Colleges or Universities")
hospital <- filter(facility, facsubgrp == "Hospitals and Clinics")
office <- filter(facility, facsubgrp == "Offices")
parkinglot <- filter(facility, facsubgrp == "Parking Lots and Garages")
recreation <- filter(facility, facsubgrp == "Recreation and Waterfront Sites")
plazamalls <- filter(facility, facsubgrp == "Streetscapes, Plazas, and Malls")

# 1: distance to public schools
school <- school %>%
  select(geometry) %>%
  st_transform(4326) %>%
  mutate(landUse = "school")
Station <- MTAStation3 %>%
  select(geometry) %>%
  st_transform(4326) %>%
  mutate(landUse = "Station")
allPlaces <- rbind(school, Station)
allPlaces <- cbind(as.data.frame(st_coordinates(allPlaces)), data.frame(allPlaces))
schoolXY <-
  allPlaces %>%
  filter(landUse == "school") %>%
  select(X,Y) %>%
  as.matrix() 
StationXY <-
  allPlaces %>%
  filter(landUse == "Station") %>%
  select(X,Y) %>%
  as.matrix()   
nn = get.knnx(schoolXY,StationXY,k=5)
data <-
  as.data.frame(nn$nn.dist) %>%
  rownames_to_column(var = "Station") %>%
  gather(school, school_Distance, V1:V5) %>%
  arrange(as.numeric(Station)) %>%
  group_by(Station) %>%
  summarize(d_school = mean(school_Distance)) %>%
  arrange(as.numeric(Station)) %>% 
  select(-Station) %>%
  bind_cols(data)

Below is a map showing the calculated distance to school variable at all MTA subway stations in NYC.

Next I import census data and join it to the data frame named data that contains all distance variables prepared previously. Note that I do write additional code to deal with NAs in the joined dataset.

### Variable 3: census variables ###
census <- st_read("NYCcensus.shp")
census <- st_transform(census, st_crs(MTAStation3))

data_sf <- st_as_sf(data, coords = c("Longitude", "Latitude"), crs = 4326)
joincensus <-st_join(data_sf,census)

The following map shows the percentage of development across New York.

The code below shows how I compute time lagged net entries and add it to my variable dataset.

### Variable 4: time lag ###
# Get previous hour time lag value
jointimelag <- joinweather
timelag <- lag(jointimelag$Hour, k=1)
lag_NEntries <- lag(jointimelag$NetEntries, k=1)
lag_NExits <- lag(jointimelag$NetExits, k=1)
jointimelag <- data.frame(cbind(jointimelag, timelag, lag_NEntries, lag_NExits))
jointimelag <- subset(jointimelag, select = -timelag)

The following code shows how I add other time variables to my dataset, including day of the week and whether the hour is during the peak traffic hours.

### Variable 5: day of the week ###
timevars <- jointimelag
time.vars <- mdy(timevars$Date)
timevars <- timevars %>%
  mutate(Weekday = wday(time.vars))

### Variable 6: rush hour or not ###
# "Peak Fares are charged during business rush hours, on any weekday train scheduled 
# to arrive in NYC terminals between 6 and 10 AM or depart NYC terminals between 4 and 8 PM."
# Rush hour in this case: 8,9,16,17,20
timevars$Rush_Hour <- "0"
timevars <- timevars %>%
  mutate(Rush_Hour = ifelse(timevars$Hour==8|timevars$Hour==9|timevars$Hour==16|
                              timevars$Hour==17|timevars$Hour==20, "1", "0"))

I also count the number of events per hour in July and August 2017. If there is no event in a certain hour, I replace NA with zero.

### Variable 7: Count of event ###
# Get event calendar in July and Aug 2017
event_holiday <- read.csv("NYC_Permitted_Event_Information_-_Historical.csv")
time.event = mdy_hms(event_holiday$Start.Date.Time)
time.event.end = mdy_hms(event_holiday$End.Date.Time)
event_holiday <- event_holiday %>%
  mutate(Year = year(time.event),
         Month = month(time.event),
         Day = day(time.event),
         Hour = hour(time.event))
# Count number of event per hour
event_holiday2017 <- filter(event_holiday, Year == 2017)
event_holiday78 <- filter(event_holiday2017 , Month == 7|Month == 8)
count_event <- event_holiday78 %>%
  group_by(Month, Day, Hour) %>%
  summarise(Event_Count = n())
joinevent <- left_join(timevars, count_event, by=c("Month","Day","Hour"))
joinevent$Event_Count <- replace(joinevent$Event_Count, is.na(joinevent$Event_Count), 0)

Lastly, I add bike trips and taxi trips as variables.

### Variable 8: bike trips ###
addbiketrips <- joinevent
biketrip7 <- read.csv("JC-201707-citibike-tripdata.csv")
biketrip8 <- read.csv("JC-201708 citibike-tripdata.csv")
biketrips <- rbind(biketrip7,biketrip8)
time.bike <- ymd_hms(biketrips$starttime)
biketrips <- biketrips %>%
  mutate(Month = as.character(month(time.bike)),
         Day = as.character(day(time.bike)),
         Hour = as.character(hour(time.bike)))
count_biketrips <- biketrips %>%
  group_by(Month, Day, Hour) %>%
  summarise(BikeTrips = n())
addbiketrips <- left_join(addbiketrips, count_biketrips, by=c("Month","Day","Hour"))

### Variable 9: taxi trips ###
addtaxitrips <- addbiketrips
# Green taxi
taxi_g <- read.csv("2017_Green_Taxi_Trip_Data.csv")
time.taxi <- mdy_hms(taxi_g$Start_Time)
taxitrips <- taxi_g %>%
  mutate(Year = as.character(year(time.taxi)),
         Month = as.character(month(time.taxi)),
         Day = as.character(day(time.taxi)),
         Hour = as.character(hour(time.taxi)))
count_taxitrips <- taxitrips %>%
  group_by(Month, Day, Hour) %>%
  summarise(TaxiTrips = n())
addtaxitrips <- left_join(addtaxitrips, count_taxitrips, by=c("Month","Day","Hour"))

# Yellow taxi (already filtered by July and Aug 2017)
addtaxitrips2 <- addtaxitrips
taxi_y <- read.csv("2017_Yellow_Taxi_Trip_Data.csv")
time.taxi2 <- mdy_hms(taxi_y$tpep_pickup_datetime)
taxitrips2 <- taxi_y %>%
  mutate(Year = as.character(year(time.taxi2)),
         Month = as.character(month(time.taxi2)),
         Day = as.character(day(time.taxi2)),
         Hour = as.character(hour(time.taxi2)))
count_taxitrips2 <- taxitrips2 %>%
  group_by(Month, Day, Hour) %>%
  summarise(TaxiTrips2 = n())
addtaxitrips2 <- left_join(addtaxitrips2, count_taxitrips2, by=c("Month","Day","Hour"))

And below is a scatter plot visualizing how subway net entries are correlated with hourly taxi trips. From the scatter plot we can conclude that subway net entries and hourly taxi trips are positively correlated.

Finally, we are ready to combine all variables for running the regression by joining addtaxitrips2 and joincensus.

myvars <- left_join(addtaxitrips2, joincensus, by = "Station")

Now myvars includes the following variables.

names(myvars)
##  [1] "Station"      "Date"         "Hour"         "NetEntries"  
##  [5] "NetExits"     "Month"        "Day"          "TEMP"        
##  [9] "HUMIDITY"     "PRESSURE"     "WIND_SPEED"   "WIND_DIR"    
## [13] "WEATHER"      "TEMP_C"       "lag_NEntries" "lag_NExits"  
## [17] "Weekday"      "Rush_Hour"    "Holiday"      "Event_Count" 
## [21] "Event_Count2" "BikeTrips"    "TaxiTrips"    "TaxiTrips2"  
## [25] "d_plazamalls" "d_recreation" "d_parkinglot" "d_office"    
## [29] "d_hospital"   "d_college"    "d_subway"     "d_parks"     
## [33] "d_cbd"        "d_busstop"    "d_school"     "TOTPOP_CY"   
## [37] "TOTHH_CY"     "POPDENS_CY"   "MEDHINC_CY"   "TOTHU_CY"    
## [41] "OWNER_CY"     "RENTER_CY"    "S01_BUS"      "WHITE_CY"    
## [45] "BLACK_CY"     "AMERIND_CY"   "ASIAN_CY"     "PACIFIC_CY"  
## [49] "OTHRACE_CY"   "HISPPOP_CY"   "NLCDDevPt"    "lat"         
## [53] "lng"          "geometry"

Step 4: Running the Regression Model

The dependent variable, net entries at each station every 4 hours, contains a lot of negative values, zeros and weirdly high numbers. For negative values and weirdly high numbers, I decide to remove these rows of data to avoid the model being affected. For zeros, I decide to only keep stations with less than 10 zero net entries values.

### Deal with problematic data
# Remove Net Entries < 0 and Net Exits < 0
filter_vars <- filter(myvars, NetEntries > -1 & NetExits > -1)
# Remove weirdly high number of net entries
filter_vars <- filter(filter_vars, NetEntries < 500000)

# Check how many zeros I have
zeros <- filter(filter_vars, NetEntries == 0)
# Which station has most zeros?
count_zero <- zeros %>%
  group_by(Station) %>%
  summarise(zeros = n()) %>%
  arrange(-zeros)
# Only keep stations with less than 10 zero Net Entries values
keep <- filter(count_zero, zeros < 10) # 262
keep <- as.data.frame(keep) %>%
  select(-zeros, -geometry)
filter_vars <- left_join(keep, filter_vars) # 105620

One last thing before running our model: checking if my selected variables are inter-correlated. I plot a correlation matrix to investigate the correlation between all numeric variables.

bizCor <- filter_vars[,c(9:11, 14:15,20:35,38:40,43,51)]
M <- cor(bizCor)
corrplot(M, method = "number")

Based on the correlation matrix plot, I delete d_hospital, d_cbd and d_college to avoid multi-collinearity. I also combine the two taxi trips variables.

filter_vars2 <- filter_vars %>%
  mutate(TaxiTrips = filter_vars$TaxiTrips + filter_vars$TaxiTrips2) %>%
  select(-d_hospital, -d_cbd, -d_college, -TaxiTrips2)

filter_vars2 is the final prepared dataset for running OLS regression. I run in total of 10 regression models by modifying the independent variables. I choose the final regression model based on adjusted R Squared. The following is a table summarising the 10 regression models and their adjusted R Squared.

Model_NO Description Adj_R_Squared
1 Only Station and Date 0.3824838
2 Station, Date, Hour 0.4527448
3 Add weather 0.4532201
4 Add census 0.4527448
5 Add time lag 0.4528256
6 Add trips and events 0.4588212
7 Include all prepared varibles 0.7158453
8 All prepared variables excluding Station and Hour 0.6982653

The code below shows how I run my final OLS regression model.

reg <- lm(NetEntries ~ ., data = as.data.frame(filter_vars2) %>% 
               select(-lat,-lng, -geometry, -Station, -Hour))

Step 5: Out-of-Sample Regression and Cross Validation

This step will be explained later in the Results section together with my regression model results.

Step 6: Congestion Score and Congestion Level

First of all, I add predicted net entries to filter_vars2 since it will help me determine the congestion level of each subway station at a specified time.

vars.predict <- filter_vars2 %>% 
  mutate(pred = reg$fitted.values)

To compute congestion score, I also count the number of turnstiles, entrances and subway lines running at each station using the NYC Transit Subway Entrance and Exit CSV (MTAStation in the code chunk below).

# Factor 1: Number of subway lines running at the station
MTAStationEntrances <- MTAStation
MTAStation <- MTAStation[!duplicated(MTAStation$Station.Name),]
wideformat <- MTAStation %>%
  select(Station.Name, Route1, Route2, Route3, Route4, Route5, Route6, Route7, Route8, Route9, Route10, Route11)
tallformat <- gather(wideformat, route, line, Route1:Route11, factor_key=TRUE)
tallformat <- na.omit(tallformat)
tallformat <- tallformat[!apply(tallformat, 1, function(x) any(x=="")),] 
count_line <- tallformat %>%
  group_by(Station.Name) %>%
  summarise(N_Lines = n())

# Factor 2: Number of entrances
count_entrance <- MTAStationEntrances %>%
  group_by(Station.Name) %>%
  summarise(N_Entrances = n())

# Factor 3: Number of turnstiles
allturnstile <- turnstileJulAug[!duplicated(turnstileJulAug$serial),]
count_turnstiles <- allturnstile %>%
  group_by(Station) %>%
  summarise(N_Turnstiles = n())

Then I join these three counts with the data frame containing predicted net entries - vars.predict.

# Join factor 1 & 2
countDF <- left_join(count_line,count_entrance)
# Join filtered MTA stations with factor 3
countDF2 <- left_join(MTAStation2,count_turnstiles)
# Join these counts with predicted vars data
joincount <- left_join(vars.predict, finalcount, by="Station")
congestionCalc <- joincount[,c(1:3,6:7,17:18,48:51,54:56)]

As mentioned above in the Methodology section, I use net entries and the three counts to calculate Net entries per turnstile (s_entTrnst), Net entries per entrances (s_entEnt), Net entries per subway lines (s_entLine), Turnstiles per entrance (s_trnstEnt) and Subway lines per entrance (s_lineEnt). Together with another three factors, Weekday vs. Weekend (s_weekday), Rush hour vs. Not rush hour (s_rushhr) and Predicted net entries (s_pred), these 8 metrics are reclassified and weighted to compute the final congestion score, as is shown in the code below.

### Recode
# From 1 to 5, not congested to congested
# Factor 1: Weekday vs. Weekend (6,7,1 - score 5, other - score 1)
congestionCalc$s_weekday<-recode(congestionCalc$Weekday,"1=5;6:7=5;else=1")
# Factor 2: Rush hour vs. Not rush hour (1 - score 5, 0 - score 1)
congestionCalc$s_rushhr<-recode(congestionCalc$Rush_Hour,"1=5;else=1")
# Factor 3: Predicted net entries
quantile(congestionCalc$pred, c(0, .2, .4, .6, .8, 1))
congestionCalc$s_pred<-recode(congestionCalc$pred,
                              "-945:384.1818=1; 384.1818:918.1196=2;
                              918.1196:1517.5255=3; 1517.5255:2686.9403=4;
                              2686.9403:332742=5")
# Factor 4: Net entries per turnstile
quantile(congestionCalc$EntriesPerT, c(0, .2, .4, .6, .8, 1))
congestionCalc$s_entTrnst<-recode(congestionCalc$EntriesPerT,
                              "-300:36.34934=1; 36.34934:95.08506=2;
                              95.08506:165.33101=3; 165.33101:269.55590=4;
                              269.55590:15789=5")
# Factor 5: Net entries per line
congestionCalc$EntriesPerL[is.na(congestionCalc$EntriesPerL)] <- 0
quantile(congestionCalc$EntriesPerL, c(0, .2, .4, .6, .8, 1))
congestionCalc$s_entLine<-recode(congestionCalc$EntriesPerL,
                                  "-945:215.3992=1; 215.3992:552.4522=2;
                                  552.4522:957.3827=3; 957.3827:1660.4129=4;
                                  1660.4129:157887=5")
# Factor 6: Net entries per entrance
congestionCalc$EntriePerE[is.na(congestionCalc$EntriePerE)] <- 0
quantile(congestionCalc$EntriePerE, c(0, .2, .4, .6, .8, 1))
congestionCalc$s_entEnt<-recode(congestionCalc$EntriePerE,
                                 "-899:74.18413=1; 74.18413:191.43871=2;
                                 191.43871:341.49409=3; 341.49409:612.21923=4;
                                 612.21923:39472=5")
# Factor 7: Turnstiles per entrance
congestionCalc$TrnstPerE[is.na(congestionCalc$TrnstPerE)] <- 0
quantile(congestionCalc$TrnstPerE, c(0, .2, .4, .6, .8, 1))
congestionCalc$s_trnstEnt<-recode(congestionCalc$TrnstPerE,
                                "0:1.5=1; 1.5:1.833333=2;
                                1.833333:2.4=3; 2.4:3.166667=4;
                                3.166667:27=5")
# Factor 8: Lines per entrance
congestionCalc$LinesPerE[is.na(congestionCalc$LinesPerE)] <- 0
quantile(congestionCalc$LinesPerE, c(0, .2, .4, .6, .8, 1))
congestionCalc$s_lineEnt<-recode(congestionCalc$LinesPerE,
                                  "0:0.2=1; 0.2:0.25=2;
                                  0.25:0.4705882=3; 0.4705882:0.75=4;
                                  0.75:5=5")
# Weight these 8 factors and calculate the final congestion score
# Rank of importance: 
# s_entLine - 5
# s_weekday - 4 
# s_entTrnst & s_entEnt - 3
# s_trnstEnt & s_lineEnt & s_rushhr - 2
# s_pred - 1
congestionCalc <- congestionCalc %>%
  mutate(CongestionScore = 5*s_entLine + 4*s_weekday + 3*(s_entTrnst+s_entEnt) +
           2*(s_trnstEnt+s_lineEnt+s_rushhr) + s_pred)

quantile(congestionCalc$CongestionScore, c(0, .1, .4, .6, .9, 1))
congestionCalc$alert<-recode(congestionCalc$CongestionScore,
                                 "22:37='Congestion Level 1'; 37:60='Congestion Level 2';
                                 60:70='Congestion Level 3'; 70:87='Congestion Level 4';
                                 87:110='Congestion Level 5'")

Step 7: Building Shiny App

The last step is using the dataframe containing computed congestion scores to create a user interactive Shiny web application. Please check out my server.R and ui.R code to see how I build the Shiny web app in R. Again, my web app has been published and can be accessed here.

Results

In this section, I will first present results of exploratory analysis, OLS in-sample regression and out-of-sample regression, and then demonstrate how I validate the model and how user can interact with the MTA congestion forecast web application.

Exploratory Analysis Results

My goal for exploratory data analysis is to better understand the distribution of the dependent variable - net entries every 4 hour at each MTA station.

Firstly, I look at average ridership each day of the week. My assumption is that there would be fewer riderships on the weekdays and more riderships on the weekends. The following code shows how I aggregate net entries data and calculate average total riderships from Monday to Sunday at all stations. The result plot matches my expectations that average rideships are higher on the weekends.

Day_of_Week_Filtered <- filter_vars2 %>%
  arrange(NetEntries)
Day_of_Week_Filtered <- within(Day_of_Week_Filtered, Weekday[Weekday== 7] <- "Monday")
Day_of_Week_Filtered <- within(Day_of_Week_Filtered, Weekday[Weekday == 1] <- "Tuesday")
Day_of_Week_Filtered <- within(Day_of_Week_Filtered, Weekday[Weekday == 2] <- "Wednesday")
Day_of_Week_Filtered <- within(Day_of_Week_Filtered, Weekday[Weekday == 3] <- "Thursday")
Day_of_Week_Filtered <- within(Day_of_Week_Filtered, Weekday[Weekday == 4] <- "Friday")
Day_of_Week_Filtered <- within(Day_of_Week_Filtered, Weekday[Weekday == 5] <- "Saturday")
Day_of_Week_Filtered <- within(Day_of_Week_Filtered, Weekday[Weekday == 6] <- "Sunday")

Day_of_Week_Filtered <- Day_of_Week_Filtered %>%
  group_by(Date, Weekday) %>%
  summarise(entries = sum(NetEntries)) %>%
  ungroup() %>%
  group_by(Weekday) %>%
  mutate(total_days = n(),
         avg_entries_per_day = sum(entries) / mean(total_days)) %>%
  select(Weekday, total_days, avg_entries_per_day) %>%
  distinct(Weekday, total_days, avg_entries_per_day)

ggplot(Day_of_Week_Filtered, aes(Weekday, avg_entries_per_day, fill = Weekday)) + 
  geom_bar(stat = "identity") + 
  scale_fill_brewer(palette = "Set2") +
  theme(axis.text.x = element_text(angle = 45)) +
  scale_y_continuous(labels=scales::comma) + 
  scale_x_discrete(limits=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))+
  guides(fill = FALSE) + 
  xlab("") + ylab("Total Entries") + 
  ggtitle("Average Ridership \nEach Day of the Week") +
  plotTheme

Secondly, I take a look at average ridership for each 4-hour period in a day. This time, the result plots do not match my assumptions exactly. I expected the aggregated net entries to reach one peak before work and the other after work, but according to the results, the peaks are at around noon and and around 8 pm in the evening.

Day_of_Week_hour <- filter_vars2 %>%
  group_by(Hour) %>%
  summarise(total_entries = sum(NetEntries),
            n = n(),
            normalized = total_entries / n)

Day_of_Week_hour2 <- filter(Day_of_Week_hour, Hour==0|Hour==4|Hour==8|Hour==12|Hour==16|Hour==20)

ggplot(Day_of_Week_hour2, aes(x = Hour, y = normalized,
                             fill = Hour)) +
  geom_bar(stat = 'identity') +
  scale_y_continuous(labels=scales::comma) +
  scale_x_discrete(limits=c("0","4","8","12","16","20"))+
  scale_fill_brewer(palette = 'Set2') +
  guides(fill = FALSE) +
  xlab("") + ylab("Total Entries") + 
  ggtitle("Average Ridership for each 4-hour Time Period") +
  plotTheme

Thirdly, I combine hour and day and show them on the same plot. Again, the result plot matches my expectation that the general hourly trend is very similar from Monday to Sunday.

Day_of_Week_And_Hour <- Day_of_Week_Filtered %>%
  group_by(Weekday, Hour) %>%
  summarise(total_entries = sum(NetEntries), 
            n = n(),
            normalized = total_entries / n)

Day_of_Week_And_Hour$Weekday = factor(Day_of_Week_And_Hour$Weekday, 
                levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
Day_of_Week_And_Hour2 <- filter(Day_of_Week_And_Hour, Hour==0|Hour==4|Hour==8|Hour==12|Hour==16|Hour==20)

ggplot(Day_of_Week_And_Hour2, aes(Hour, normalized, fill = Hour)) +
  geom_bar(stat = 'identity') +
  facet_grid(Weekday ~ .) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_discrete(limits=c("0","4","8","12","16","20"))+
  scale_fill_brewer(palette = 'Set2') +
  theme(legend.position = 'None',
        panel.grid.major = element_blank(),
        panel.grid.minor = element_line(colour = "#626262")) +
  xlab("") + ylab("Total Entries") +
  plotTheme

Lastly, it would be interesting to see the total net entries per station, and as expected some subway stations are much more popular than others. I only show the top 50 stations based on total net entries in the plot below.

stations <- filter_vars2 %>%
  group_by(Station) %>%
  summarise(total_entries = sum(NetEntries)) %>%
  arrange(-total_entries)

ggplot(stations[1:50,], aes(x = reorder(Station, -total_entries), total_entries)) +
  geom_bar(stat = 'identity', fill = '#cbc9e2', color = 'black') +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(labels = scales::comma) + 
  xlab("Station Name") + ylab("Total Entries") +
  ggtitle("Total Entries for each Station - Top 50 Stations") +
  theme(legend.position = 'None',
        panel.grid.major = element_line(colour = "#626262"),
        panel.grid.minor = element_line(colour = "#626262")) +
  plotTheme

In-Sample Regression Results

I first look at the regression result. The r squared is 0.6986 and adjusted r squared is 0.6983. In other words, the regression model explains about 70% of the variance in the dependent variable net entries per station every 4 hour.

summary(reg)$r.squared
## [1] 0.6985912
summary(reg)$adj.r.squared
## [1] 0.6982653

I also plot predicted net entries as a function of observed net entries. From the plot we can see the trend line has a slope close to 0.75, meaning that predicted net entries tend to be lower than observed ones. Therefore, my regression model should be improved later to better predict the dependent variable.

regFullset <-
  data.frame(observed = filter_vars2$NetEntries,
             predicted = reg$fitted.values,
             error = filter_vars2$NetEntries - reg$fitted.values)

#Predicted net entries as a function of observed net entries
ggplot(data = regFullset %>% filter(predicted < 200000 & observed <200000), aes(x = observed, y = predicted )) +
  geom_point(color="#F7DC6F", size = 1, shape = 1) +
  labs(title="PREDICTIED NET ENTRIES AS A FUNCTION OF OBSERVED NET ENTRIES",
       subtitle="Full Dataset Prediction Result",
       x="Observed",
       y="Predicted") +   
  geom_smooth(stat = 'smooth',method = lm, color = "#F39C12") +
  theme(panel.grid.major = element_line(colour = "#626262"),
        panel.grid.minor = element_line(colour = "#626262")) +
  plotTheme

To visualize the impact of the selected predictors, I create a bar chart comparing the standardized coefficients of each predictor using the QuantPsyc library.

library(QuantPsyc)
standardized <- as.data.frame(lm.beta(reg))
standardized$variable <- row.names(standardized)
colnames(standardized)[1] <- "std_coefficient"
standardized
standardized2 <- 
  standardized[-c(1,9:25,27), ] 

standardized2$absCoef <- abs(standardized2$std_coefficient)

ggplot(standardized2, aes(x=reorder(variable,-absCoef), y=absCoef, fill=variable)) + 
  geom_bar(stat="identity") +
  labs(title="STANDARDIZED OLS REGRESSION COEFFICIENTS",
       x="Variable",
       y="Absolute Standardized Coefficients") + 
  scale_x_discrete(limits = rev(levels(standardized$variable)))+
  theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 12),
        axis.title=element_text(size = 14), 
        plot.title = element_text(hjust = 0.5, size = 16),
        panel.grid.major = element_line(colour = "#626262"),
        panel.grid.minor = element_line(colour = "#626262")) + 
  guides(fill=guide_legend(title="Variable Name")) +
  plotTheme

From the standardized coefficient graph generated above, we can see that a number of distance factors and some time variables like count of event and rush hour or not have great impacts on the net entries every 4 hours.

After checking the significance levels of the selected predictors, I plot additional time series graphs and maps to help visualize the prediction results.

The following two graphs compare predicted vs. actual net entries every 4 hours for all stations from day 1 to day 31 in both July and August. Below is a sample code of how I create each individual plot in a for loop in R. These two time series graphs show that my model captures the general temporal trend of net entries change during a day in both July and August.

timeplot_July <- filter(vars.predict, Month == 7)

for (i in 1:31){
  calc_July <- timeplot_July  %>% 
    filter(timeplot_July$Day == i) %>%
    group_by(Hour) %>%
    arrange(Hour) %>%
    summarise(entries = sum(NetEntries),entries_pred = sum(pred))
  
  title <- paste0("July - ","Day ", i)
  plot_July <- ggplot(data = calc_July, aes(Hour),colour=variable) + 
    geom_line(aes(y = entries, colour = "Observed",group = 1), linetype = 'dotted',size=2) + 
    geom_line(aes(y = entries_pred, colour = "Predicted",  group = 1), size=2) +
    labs(colour='', 
         title=title,
         x="Hour",
         y="Sum of Net Entries at all Stations") + 
    scale_y_continuous(labels = scales::comma) +
    plotTheme2
  
  imagename_July <- paste0("plot_July", i, ".jpg") 
  ggsave(filename=imagename_July, plot=last_plot(), width = 11, height = 7)
  
  }

The red dotted lines are observed net entries while the blue lines are predicted values.

July: Alt Text

August: Alt Text

After looking at the prediction power by hour in each day of a month, I create another 2 maps to show the prediction power by station in July and August. Before plotting maps, I first create a base map. Below is a sample code of how I create a percent error map visualizing how well my model predicts spatially in July 2017. Same code is used to create the percent error map for data in August 2017.

# Set up basemap
baseMap <- get_map(location = c(lon = -73.9402338, lat = 40.7422241),
                   zoom = 11, 
                   maptype= 'toner-lite')

percent <- function(x, digits = 2, format = "f", ...) {
  paste0(formatC(100 * x, format = format, digits = digits, ...))
}

errormap1 <- vars.predict  %>% 
  filter(vars.predict$Month == 7) %>%
  group_by(Station) %>%
  summarise(entries = sum(NetEntries),entries_pred = sum(pred)) %>%
  mutate(percent_error = abs(entries_pred - entries) / entries) %>%
  mutate(percentage = as.numeric(percent(percent_error)))

stations = vars.predict[!duplicated(vars.predict$Station),]
errormap1 <- left_join(errormap1, stations)

ggmap(baseMap) + 
  geom_point(data = errormap1, 
             aes(x=lng, y=lat, color=factor(ntile(percentage,5))), 
             size = 4) + 
  labs(title="Percent Error - July 2017\n") +
  scale_colour_manual(values = c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494"),
                      labels=as.character(quantile(errormap1$percentage,
                                                   c(.1,.2,.4,.6,.8),na.rm=T)),
                      name="% Error (Quantile Breaks)") +
  myTheme()

The percent error maps above indicate that, my model predicts better spatially for Manhattan area, as lighter blue dots are more common in the Manhattan area compared with other areas in NYC. I therefore decide to do an additional analysis: apply the same model to only stations in Manhattan and compare the regression result with the original one.

To compare regression results between my original regression applied to all MTA stations and the new one applied only to stations in Manhattan, in addition to R Squared, I also calculate Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Mean Absolute Percent Error (MAPE) for both models. Below is the comparison table.

dataset Adjusted_RSquared RMSE MAE MAPE
All Stations 0.6982653 2279.927 1067.107 1.856853
Manhattan Stations 0.6941281 3505.811 1920.684 2.233543

The comparison results show that, although spatially my model seems to predict better for subway stations in Manhattan, overall my model still performs better whem applying for all MTA stations, with higher adjusted R squared, lower RMSE, MAE and MAPE.

Out-of-Sample Regression Results

For the out-of-sample regression, I randomly select 75% of the net entries observations from the data as the training set and the other 25% net entries observations as the test set.

# Generate randomly selected training set and test set
inTrain <- createDataPartition(
  y = filter_vars2$NetEntries, 
  p = .75, list = FALSE)
training <- filter_vars2[ inTrain,] #the training set
test <- filter_vars2[-inTrain,]  #the test set

I then run regression using the same model for both training dataset and test dataset.

#Training set regression
regTrain <- lm(NetEntries ~ ., data=training %>% select(-lat,-lng, -geometry, -Station, -Hour))
#Test set regression
regTest <- lm(NetEntries ~ ., data=test %>% select(-lat,-lng, -geometry, -Station, -Hour))

The following code shows how I compute goodness of fit metrics and indicators like RMSE, MAE and MAPE for training dataset. I use the same code to obtain goodness of fit metrics and indicators for test set.

#Training set goodness of fit metrics and indicators
regTrainingSet <-
  data.frame(observed = training$NetEntries,
             predicted = regTrain$fitted.values)
regTrainingSet <-
  regTrainingSet %>%
  mutate(error = predicted - observed) %>%
  mutate(absError = abs(predicted - observed)) %>%
  mutate(percentAbsError = abs(predicted - observed) / observed)

rsquared_training <- summary(regTrain)$adj.r.squared
rmse_training <- (mean((regTrainingSet$error)^2))^(1/2)
mae_training <- MAE(regTrainingSet$predicted, regTrainingSet$observed)
mape_training <- mean(regTrainingSet$percentAbsError)

Below is the summary table of R-squared, RMSE, MAE and MAPE for the randomly selected training set and test set regression results.

dataset Adjusted_RSquared RMSE MAE MAPE
training 0.6886289 2300.843 1071.941 1.859193
test 0.7266025 2207.692 1057.125 1.901589

In addition, plotting the model residuals is a usefull way to further explore the model fit. I plot distribution of residuals, residuals against predictions, predictions against hourly entries, residuals in sequential order per hour per station for both training dataset and test dataset. The code below provides an example of how I generate these plots for training dataset.

p1 <- ggplot() + geom_histogram(aes(regTrain$residuals), binwidth = 200, fill="darkorange") +
  xlab('Residuals') + ylab('Frequencies') +
  ggtitle('Histogram of Residuals - Training Set') +
  theme(panel.grid.major = element_line(colour = "#626262"),
        panel.grid.minor = element_line(colour = "#626262")) +
  plotTheme

p2 <- ggplot(data = regTrainingSet %>% filter(predicted < 200000 & error < 50000), 
             aes(x = predicted, y = error)) +
  geom_point(color="darkorange", size = 1, shape = 1) +
  geom_hline(yintercept = 0, linetype = 'dotted', size = 2, color = 'red') + 
  xlab('Predictions') + ylab('Residuals') + 
  ggtitle('Residuals against Predictions - Training Set')+
  theme(panel.grid.major = element_line(colour = "#626262"),
        panel.grid.minor = element_line(colour = "#626262")) +
  plotTheme

p3 <- ggplot(data = regTrainingSet %>% filter(predicted < 200000 & observed < 200000), 
             aes(x = predicted, y = observed)) +
  geom_point(color="darkorange", size = 1, shape = 1) +
  geom_abline(slope = 1, linetype = 'dotted', color = 'red', size = 2) + 
  xlab("Predictions") + ylab("Actual Hourly Entries") + 
  ggtitle("Predictions against Hourly Entries - Training Set")+
  theme(panel.grid.major = element_line(colour = "#626262"),
        panel.grid.minor = element_line(colour = "#626262")) +
  plotTheme

p4 <- ggplot() + geom_point(aes(1:nrow(training), regTrain$residuals), data = training,
                            color="darkorange", size = 1, shape = 1) +
  geom_hline(yintercept = 0, linetype = 'dotted', size = 2, color = 'red') + 
  xlab('Index') + ylab('Residuals') + 
  ggtitle('Residuals Per Hour Per Station in Sequential Order\n - Training Set')+
  theme(panel.grid.major = element_line(colour = "#626262"),
        panel.grid.minor = element_line(colour = "#626262")) +
  plotTheme

library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)

I use the same code to create the residuals plots for test dataset.

Lastly, I use my training dataset regression results to predict hourly net entries for the test dataset.

# Check how well our traininig set predict test set
regPred <- predict(regTrain, test)

predictPlot <- data.frame(observed = test$NetEntries,
                          predicted = regPred)

ggplot(data = predictPlot %>% filter(predicted < 200000 & observed <200000), aes(x = observed, y = predicted)) +
  geom_point(color="#58D68D", size = 1, shape = 1) +
  labs(title="PREDICTIED NET ENTRIES VS. OF OBSERVED NET ENTRIES",
       subtitle="Test Dataset Prediction Result",
       x="Observed",
       y="Predicted") + 
  geom_smooth(stat = 'smooth',method = lm, color = "#044389") +
  theme(panel.grid.major = element_line(colour = "#626262"),
        panel.grid.minor = element_line(colour = "#626262")) +
  plotTheme

The plot also shows an under-prediction for net entries.

Model Validation

I conduct cross-validation as another way to investigate the model quality. It enables us to see how generalizable the goodness of fit of the model is.

The cross validation process is described as follows: first the subway net entries data is randomly divided into 20 folds with approximately equal size, then one fold is selected randomly as a test set and the other 19 folds as training sets. The procedure repeats 20 times until each fold has been used as a test set. Finally, a goodness of fit matrix will be obtained across each of the 20 fold of test sets in cross-validation.

fitControl <- trainControl(method = "cv", number = 20)
set.seed(825)

varsCV <- filter_vars2 %>%
  select(-lat,-lng, -geometry, -Station, -Hour)

lmFit <- train(NetEntries ~ ., data = varsCV, 
               method = "lm", 
               trControl = fitControl)

Here is the cross validation result.

RMSE Rsquared MAE Resample
2273.643 0.7269903 1079.216 Fold01
2112.663 0.6257141 1032.059 Fold02
2150.447 0.7578899 1032.456 Fold03
3283.720 0.8653957 1127.074 Fold04
2460.395 0.5779406 1101.971 Fold05
2323.728 0.8548676 1108.510 Fold06
2098.549 0.6853084 1037.469 Fold07
2438.451 0.7206981 1118.911 Fold08
2288.554 0.6017768 1087.111 Fold09
2204.858 0.5644986 1074.070 Fold10
2440.888 0.6608190 1093.037 Fold11
2182.560 0.6059752 1051.069 Fold12
2231.308 0.5887032 1087.412 Fold13
2091.932 0.6379864 1037.476 Fold14
2276.017 0.5724834 1065.731 Fold15
2268.223 0.6572709 1081.702 Fold16
2195.368 0.6269748 1042.691 Fold17
2120.611 0.6376369 1045.987 Fold18
2069.979 0.6697332 1025.584 Fold19
2146.724 0.6361837 1052.295 Fold20

Since each test set is randomly selected, by looking at the mean, standard deviation and histogram of R squared generated by cross validation analysis, we can evaluate how generalizable the goodness of fit of the model is.

CVresults <-
  lmFit$resample %>%
  as.data.frame()
CVresults2 <-
  data.frame(TEST = "CROSS_VALIDATION",
             MEAN_RSQUARED = mean(CVresults$Rsquared),
             STANDARD_DEVIATION_RSQUARED = sd(CVresults$Rsquared))

Here is a table showing the mean and standard deviation of R squared. Since the standard deviation of R squared is only 0.08, we can say that the model is pretty generalizable.

kable(CVresults2)
TEST MEAN_RSQUARED STANDARD_DEVIATION_RSQUARED
CROSS_VALIDATION 0.6637423 0.0849464

Here is the histogram plot showing the cross-validation R squared across the 20 folds. We can see that the R squared values range approximately from 0.6 to 0.9.

ggplot(CVresults, aes(Rsquared)) + 
  geom_histogram(bins=20,
                 col="grey", 
                 fill="darkorange") +
  labs(title="CROSS-VALIDATION R-SQUARED",
       x="R-Squared Value",
       y="Frequency") + 
  theme(panel.grid.major = element_line(colour = "#626262"),
        panel.grid.minor = element_line(colour = "#626262")) +
  plotTheme

Web App Demo

In this part, I will demonstrate how metro users can use my MTA congestion forecase web app to help them make wise decisions when taking subway.

When users open the app, they will see a sidebar allowing them to select a specific time and a station, a map showing them congestion score and level of each station, and a navigation bar allowing them to switch from map view to table view to see the prediction results in data table format. The web is designed with minimal elements but it allows easy and sufficient user interactions.

Users can first pick a month they want to take the subway. Right now only July and August are available to be selected since the web app is based solely on my OLS prediction model, but ideally in the future users can be allowed to check the predicted congestion level in all 12 months. As users make a selection of month, the map will automatically update.

After selecting the month, users can specify a day. Again, every time users make a new selection, the map will update accordingly.

Users then should pick an hour they want to take the subway. Note that due to the limitation of the original turnstile data (only update entries and exits every 4 hours), the app does not have prediction results for every hour in a day.

Lastly, users should pick the station they want to take the subway, and press the “Zoom to Selected Station” button. The map wil zoom in and center at the station the users pick. Now users can easily check out the congestion level of the station they select at the time they choose.

By clicking each station, users can view the congestion score computed for the station.

Furthermore, users can switch from map view to result table view. The result table show a list of prediction results for all stations at the time specified by the users, including predicted net entries, congestion score and congestion level. The user can choose how many entries should be displayed on each page.

If users want to see full results for only the selected station in the result table, they can search for the station in the upper right search bar, and the result will update and only show the results for the selected station.

Conclusion and Limitation

In general, my regression model does a good job predicting overall temporal and spatial trend of hourly subway net entries at each MTA station in July and August 2017, and the model is overall generalizable with a consistent goodness of fit. However, the model need further enhancements because it underpredicts net entries, and it only explains about 70% variation in the dependent variable. Therefore, I need to consider some possible ways to improve the model in the future. For example, I can try other model types and machine learning algorithms, such as support vector machines or non-linear regression models.

One challenge of computing the congestion score and determining the congestion level is that, there is limited data and precedents online. I come up with a way to calculate the congestion score based on previous knowledge and available data, but there should definitely be more factors to consider and better congestion models to help determine the congestion level at subway stations. Some other factors I consider but fail to find data include the areas of each station and the specifit time schedules of subway line operations.

In terms of the Shiny web application, it can also be improved by expanding the prediction dataset. Instead of letting users to pick month, day and hour, the process can be simplified to only select date and hour or just use the system time. However, I think the current user interactions are simple and the results are clear, and the web app can serve as a prototype for future references.

References

Literature

Eagleburger, A. (2016, July 08). Crowd control: Simulating congestion in the D.C. Metro. Retrieved from https://mobilitylab.org/2016/07/05/crowd-control-simulating-congestion-d-c-metro/

Kim, J. (2016). Subway Congestion Prediction and Recommendation System using Big Data Analysis. Journal of Digital Convergence, 14(11), 289-295. doi:10.14400/jdc.2016.14.11.289

Coding Support

Analyze the New York Subway System, Raymond Carl, https://rpubs.com/mevanoff24/107775

MTA Subway Data v2, Raymond Carl, http://rpubs.com/raymondcarl/92744

 

=== END ===